In this notebook, we will explore the underlying ideas of developing predictive models. We will do this using fundamental R code to show the idea (first principles from a coding perspective).
We will focus on a continuous outcome context (regression), using simple linear regression as our learning model.
The data will be simulated each time to make the data generating process very clear.
Let’s start with a simple data generating process. I will simulate two predictor variables (X1 and X2), and then simulate the outcome variable (Y). Y is independent of X1 and X2 in this first simulation.
# simulate some of todays data, continuous
set.seed(42)
n_obs <- 10000
x1 <- rlnorm(n = n_obs,
meanlog = log(4),
sdlog = log(1.2))
# simulate x2 (binary)
x2 <- rbinom(n = n_obs,
size = 1,
prob = 0.3)
# simulate y
y_mean <- 8
y_sd <- 3
y <- rnorm(n = n_obs,
mean = y_mean,
sd = y_sd)
# make dataframe
df1 <- data.table(y,
x1,
x2)
# check it out
df1
Recall our goal in machine learning is to use data we can see (todays data) to inform a learning algorithm that does a good job of predicting tomorrows data.
We typically cannot observe tomorrows data, but we will simulate it here using the same data generating process (DGP) to make it clear what we are doing.
# simulate some of tomorrows data, continuous
set.seed(68)
n_obs <- 2000
# simulate x1 (continuous)
x1 <- rlnorm(n = n_obs,
meanlog = log(4),
sdlog = log(1.2))
# simulate x2 (binary)
x2 <- rbinom(n = n_obs,
size = 1,
prob = 0.3)
# simulate y
y_mean <- 8
y_sd <- 3
y <- rnorm(n = n_obs,
mean = y_mean,
sd = y_sd)
# make dataframe
df1_future <- data.table(y,
x1,
x2)
# check it out
df1_future
Let’s examine the distribution of Y. Since it is independent of X1 and X2, we will just show a histogram.
# checkout the data
df1[,hist(y,breaks = 50)]
## $breaks
## [1] -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5
## [16] 4.0 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0
## [31] 11.5 12.0 12.5 13.0 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5 18.0 18.5
## [46] 19.0 19.5 20.0 20.5
##
## $counts
## [1] 1 3 3 4 13 10 17 19 31 53 87 108 133 186 250 287 403 443 515
## [20] 506 565 636 640 703 629 585 542 497 456 392 309 267 219 129 122 83 50 40
## [39] 29 11 11 4 6 1 1 0 0 1
##
## $density
## [1] 0.0002 0.0006 0.0006 0.0008 0.0026 0.0020 0.0034 0.0038 0.0062 0.0106
## [11] 0.0174 0.0216 0.0266 0.0372 0.0500 0.0574 0.0806 0.0886 0.1030 0.1012
## [21] 0.1130 0.1272 0.1280 0.1406 0.1258 0.1170 0.1084 0.0994 0.0912 0.0784
## [31] 0.0618 0.0534 0.0438 0.0258 0.0244 0.0166 0.0100 0.0080 0.0058 0.0022
## [41] 0.0022 0.0008 0.0012 0.0002 0.0002 0.0000 0.0000 0.0002
##
## $mids
## [1] -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 2.25
## [13] 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75 8.25
## [25] 8.75 9.25 9.75 10.25 10.75 11.25 11.75 12.25 12.75 13.25 13.75 14.25
## [37] 14.75 15.25 15.75 16.25 16.75 17.25 17.75 18.25 18.75 19.25 19.75 20.25
##
## $xname
## [1] "y"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
As expected, the future data looks just like the current data since it is from the same data generating process.
# checkout the data
df1_future[,hist(y,breaks = 50)]
## $breaks
## [1] -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5
## [16] 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0 12.5 13.0
## [31] 13.5 14.0 14.5 15.0 15.5 16.0 16.5 17.0 17.5
##
## $counts
## [1] 2 2 4 4 6 10 9 22 30 44 49 50 76 109 105 114 114 109 148
## [20] 125 132 136 102 102 89 79 55 52 34 20 23 20 10 4 4 2 0 4
##
## $density
## [1] 0.002 0.002 0.004 0.004 0.006 0.010 0.009 0.022 0.030 0.044 0.049 0.050
## [13] 0.076 0.109 0.105 0.114 0.114 0.109 0.148 0.125 0.132 0.136 0.102 0.102
## [25] 0.089 0.079 0.055 0.052 0.034 0.020 0.023 0.020 0.010 0.004 0.004 0.002
## [37] 0.000 0.004
##
## $mids
## [1] -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 2.25 2.75 3.25 3.75 4.25
## [13] 4.75 5.25 5.75 6.25 6.75 7.25 7.75 8.25 8.75 9.25 9.75 10.25
## [25] 10.75 11.25 11.75 12.25 12.75 13.25 13.75 14.25 14.75 15.25 15.75 16.25
## [37] 16.75 17.25
##
## $xname
## [1] "y"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
Now we need to come up with a learning algorithm that gives the best prediction for the outcome variable Y. What do you think are some good potential simple learning algorithms for this prediction? Recall that the goal here is to simply get the best prediction for the outcome variable.
We could try the mean:
# what is the 'best' prediction for y?
# try the mean
df1[,mean(y)]
## [1] 8.03543
Or we could try the median:
# what is the 'best' prediction for y?
# try the median
df1[,median(y)]
## [1] 8.062533
Or anything else you think is reasonable!
Let’s stick with the mean.
Our next step is to make predictions using our learning algorithm:
# decide on a 'learning algorithm' - mean
# make predictions for todays data
df1[,y_pred := mean(y)]
How good is our prediction? How do we know if we have a ‘good’ learning algorithm?
We can look at the distribution of the training errors (errors from todays data) like so:
# calculate training errors
df1[,y_pred_error := y - y_pred]
# check out errors
df1[,hist(y_pred_error,breaks=50)]
## $breaks
## [1] -11.5 -11.0 -10.5 -10.0 -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0
## [13] -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0
## [25] 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0
## [37] 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 10.5 11.0 11.5 12.0
## [49] 12.5
##
## $counts
## [1] 1 3 3 4 13 11 16 20 34 54 90 104 139 194 250 293 403 460 508
## [20] 501 581 626 658 699 627 591 526 495 457 372 312 271 206 128 122 81 44 42
## [39] 28 9 11 5 5 1 1 0 0 1
##
## $density
## [1] 0.0002 0.0006 0.0006 0.0008 0.0026 0.0022 0.0032 0.0040 0.0068 0.0108
## [11] 0.0180 0.0208 0.0278 0.0388 0.0500 0.0586 0.0806 0.0920 0.1016 0.1002
## [21] 0.1162 0.1252 0.1316 0.1398 0.1254 0.1182 0.1052 0.0990 0.0914 0.0744
## [31] 0.0624 0.0542 0.0412 0.0256 0.0244 0.0162 0.0088 0.0084 0.0056 0.0018
## [41] 0.0022 0.0010 0.0010 0.0002 0.0002 0.0000 0.0000 0.0002
##
## $mids
## [1] -11.25 -10.75 -10.25 -9.75 -9.25 -8.75 -8.25 -7.75 -7.25 -6.75
## [11] -6.25 -5.75 -5.25 -4.75 -4.25 -3.75 -3.25 -2.75 -2.25 -1.75
## [21] -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 2.25 2.75 3.25
## [31] 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75 8.25
## [41] 8.75 9.25 9.75 10.25 10.75 11.25 11.75 12.25
##
## $xname
## [1] "y_pred_error"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
Is that good? How do we know?
Was our goal to get devise a learning algorithm that is good at predicting the training data?
No - we want a learning algorithm that is good at predicting the future data!
let’s identify prediction errors on the future data.
First we make the prediction:
# make predictions for tomorrows data, using todays data
df1_future[,y_pred := mean(df1$y)]
Now we calculate the errors and examine them. Is a histogram a good way to summarize the errors?
# calculate generalization errors
df1_future[,y_pred_error := y - y_pred]
# check out errors
df1_future[,hist(y_pred_error,breaks=50)]
## $breaks
## [1] -9.5 -9.0 -8.5 -8.0 -7.5 -7.0 -6.5 -6.0 -5.5 -5.0 -4.5 -4.0 -3.5 -3.0 -2.5
## [16] -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0
## [31] 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5
##
## $counts
## [1] 2 2 4 4 6 10 9 26 27 44 51 51 78 113 104 114 113 115 143
## [20] 128 133 135 105 94 92 75 56 49 32 21 20 21 9 4 5 1 0 4
##
## $density
## [1] 0.002 0.002 0.004 0.004 0.006 0.010 0.009 0.026 0.027 0.044 0.051 0.051
## [13] 0.078 0.113 0.104 0.114 0.113 0.115 0.143 0.128 0.133 0.135 0.105 0.094
## [25] 0.092 0.075 0.056 0.049 0.032 0.021 0.020 0.021 0.009 0.004 0.005 0.001
## [37] 0.000 0.004
##
## $mids
## [1] -9.25 -8.75 -8.25 -7.75 -7.25 -6.75 -6.25 -5.75 -5.25 -4.75 -4.25 -3.75
## [13] -3.25 -2.75 -2.25 -1.75 -1.25 -0.75 -0.25 0.25 0.75 1.25 1.75 2.25
## [25] 2.75 3.25 3.75 4.25 4.75 5.25 5.75 6.25 6.75 7.25 7.75 8.25
## [37] 8.75 9.25
##
## $xname
## [1] "y_pred_error"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
How should we summarize these prediction errors?
Let’s calculate the mean of this distribution:
# mean errors?
df1_future[,mean(y_pred_error)]
## [1] -0.0666947
Hmm - this number is close to zero. That is good but does not tell us much about the width of the distribution - we could have large erros, but as long as they are balanced the mean error is zero…
Perhaps we can take the absolute value of the errors first, then calculate the mean:
# mean absolute errors
df1_future[,mean(abs(y_pred_error))]
## [1] 2.366212
Or alternatively we could square each error, then take the mean of the squared errors:
# mean squared error
df1_future[,mean(y_pred_error**2)]
## [1] 8.704831
What are the differences between these two approaches? How do they express our value system about the errors this model makes?
Now we will fit a linear model using the lm() function. we can condition this model on a linear combination of X1 and X2. However, we know that Y is independent of X1 and X2…so we would not expect this to improve our learning algorithm.
# should we condition on x1 and x2?
# try a linear model
mod1 <-
lm(y ~ x1 + x2,
data=df1)
Is this a good model?
Traditional summary:
# traditionally we look at summary
summary(mod1)
##
## Call:
## lm(formula = y ~ x1 + x2, data = df1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2155 -2.0917 0.0212 2.0851 12.1892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.10975 0.16810 48.243 <2e-16 ***
## x1 -0.02181 0.04050 -0.539 0.590
## x2 0.04727 0.06635 0.712 0.476
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.042 on 9997 degrees of freedom
## Multiple R-squared: 7.872e-05, Adjusted R-squared: -0.0001213
## F-statistic: 0.3935 on 2 and 9997 DF, p-value: 0.6747
Is this model better? We can compare it to the unconditional mean using this syntax:
# fit the null model
mod0 <-
lm(y ~ 1,
data=df1)
# compare these models
anova(mod0,mod1)
But recall we are doing machine learning…the question we need to ask is:
Does it make better predictions…?
Let’s calculate the mean absolute error (MAE) for both models:
# compare predictions on tomorrows data from these two models!
# compare MAE
# mod0 prediction MAE
df1_future[,mean(abs(y-predict(mod0,data=df1_future)))]
## [1] 2.366212
# mod1 prediction MAE
df1_future[,mean(abs(y-predict(mod1,data=df1_future)))]
## [1] 2.365753
The MAE is very similar.
We could also calculate the mean squared error:
# compare predictions on tomorrows data from these two models!
# compare MSE
# mod0 prediction MSE
df1_future[,mean((y-predict(mod0,data=df1_future))**2)]
## [1] 8.704831
# mod1 prediction MAE
df1_future[,mean((y-predict(mod1,data=df1_future))**2)]
## [1] 8.7023
The MSE is very similar.
As expected, the predictive performance on the out of sample future data is similar between these two methods for this data generating process. We have just shown two different learning algorithms:
Any questions about that process ?
We will now look at a slightly more complex data generating process. Our outcome variable Y now depends on both X1 and X2. The data generating process is still straightforward, and will be well captured by a linear model.
# simulate some of todays data, continuous
set.seed(42)
n_obs <- 10000
# simulate x1 (continuous)
x1 <- rlnorm(n = n_obs,
meanlog = log(4),
sdlog = log(1.2))
# simulate x2 (binary)
x2 <- rbinom(n = n_obs,
size = 1,
prob = 0.3)
# simulate y, now conditional on X1 and X2
y_mean <-
8 +
2*scale(x1) +
-4*x2
y_sd <- 3
y <- rnorm(n = n_obs,
mean = y_mean,
sd = y_sd)
# make dataframe
df2 <- data.table(y,
x1,
x2)
# check it out
df2
As before, we will simulate the future out of sample data using the same data generating process. As a reminder, we typically do not observe this data. However, the goal of machine learning is to maximize predictive accuracy on this data that we cannot observe.
# simulate some of tomorrows data, continuous
set.seed(65)
n_obs <- 10000
# simulate x1 (continuous)
x1 <- rlnorm(n = n_obs,
meanlog = log(4),
sdlog = log(1.2))
# simulate x2 (binary)
x2 <- rbinom(n = n_obs,
size = 1,
prob = 0.3)
# simulate y, now conditional on X1 and X2
y_mean <-
8 +
2*scale(x1) +
-4*x2
y_sd <- 3
y <- rnorm(n = n_obs,
mean = y_mean,
sd = y_sd)
# make dataframe
df2_future <- data.table(y,
x1,
x2)
# check it out
df2_future
Let’s visualize the data generating process:
# visualize the relationship
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
ggplot(df2_future,aes(x1,y,group=x2, color=x2)) +
geom_point(alpha=0.1) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
We will now fit a series of linear models similar to what we just did, and compare their predictive accuracy on the future data set.
# try a linear model
mod0 <-
lm(y ~ 1,
data=df2)
# try a linear model conditional on x1 and x2
mod1 <-
lm(y ~ x1 + x2,
data=df2)
# try a linear model conditional on x1 and x2 with an interaction
mod2 <-
lm(y ~ x1 + x2 + x1:x2,
data=df2)
Which of these models perform the best ?
Recall our primary interest…does it make better predictions…?
This is a summary of the training error:
# compare predictions on todays data from these two models!
# compare MAE
df2[,mean(abs(y-predict(mod0,data=df2)))]
## [1] 3.223155
df2[,mean(abs(y-predict(mod1,data=df2)))]
## [1] 2.429978
df2[,mean(abs(y-predict(mod2,data=df2)))]
## [1] 2.429928
This is a summary of the generalization error:
# compare predictions on tomorrows data from these two models!
# compare MAE
df2_future[,mean(abs(y-predict(mod0,newdata=df2_future)))]
## [1] 3.241975
df2_future[,mean(abs(y-predict(mod1,newdata=df2_future)))]
## [1] 2.403727
df2_future[,mean(abs(y-predict(mod2,newdata=df2_future)))]
## [1] 2.403904
The MAE are not similar!
Model 1 is much better than model 0.
Model 1 and 2 are similar.
We can do the same comparison with the MSE.
Training error:
# compare predictions on todays data from these two models!
# compare MSE
df2[,mean((y-predict(mod0,newdata=df2))**2)]
## [1] 16.37438
df2[,mean((y-predict(mod1,newdata=df2))**2)]
## [1] 9.253781
df2[,mean((y-predict(mod2,newdata=df2))**2)]
## [1] 9.253617
Generalization error:
# compare predictions on tomorrows data from these two models!
# compare MSE
df2_future[,mean((y-predict(mod0,newdata=df2_future))**2)]
## [1] 16.51942
df2_future[,mean((y-predict(mod1,newdata=df2_future))**2)]
## [1] 8.991186
df2_future[,mean((y-predict(mod2,newdata=df2_future))**2)]
## [1] 8.992331
The MSE are not similar!
Model 1 is much better than model 0.
Model 1 and 2 are similar.
What is the difference between these two models?
Model 1 has more capacity.
In this case, the data generating process was sufficiently complex that the increased capacity of model 1 gave us better predictive performance.
Model 2 has even more capacity than model 1 (there is an interaction term). But the performance did not improve…because the underlying data generating process was well captured by model 1.
In the first example, we did not get better predictive performance - because the underlying data generating process was well captured by a simple mean (no need to condition on X1 or x2).
I will now make the data generating process even more complex - then we will fit some models and summarize the performance.
In this DGP, there is now an interaction between X1 and X2.
# simulate some of todays data, continuous
set.seed(42)
n_obs <- 10000
# simulate x1 (continuous)
x1 <- rlnorm(n = n_obs,
meanlog = log(4),
sdlog = log(1.2))
# simulate x2 (binary)
x2 <- rbinom(n = n_obs,
size = 1,
prob = 0.3)
# simulate y, now conditional on X1 and X2
y_mean <-
8 +
-4*x2 +
2*scale(x1) +
-4*scale(x1)*x2
y_sd <- 3
y <- rnorm(n = n_obs,
mean = y_mean,
sd = y_sd)
# make dataframe
df3 <- data.table(y,
x1,
x2)
# check it out
df3
As before, we need to simulate tomorrow’s data as well:
# simulate some of tomorrows data, continuous
set.seed(65)
n_obs <- 2000
# simulate x1 (continuous)
x1 <- rlnorm(n = n_obs,
meanlog = log(4),
sdlog = log(1.2))
# simulate x2 (binary)
x2 <- rbinom(n = n_obs,
size = 1,
prob = 0.3)
# simulate y, now conditional on X1 and X2
y_mean <-
8 +
-4*x2 +
2*scale(x1) +
-4*scale(x1)*x2
y_sd <- 3
y <- rnorm(n = n_obs,
mean = y_mean,
sd = y_sd)
# make dataframe
df3_future <- data.table(y,
x1,
x2)
# check it out
df3_future
# visualize the relationship
library(ggplot2)
ggplot(df3,aes(x1,y,group=x2, color=x2)) +
geom_point(alpha=0.1) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Fit the same 3 models:
# try a linear model
mod0 <-
lm(y ~ 1,
data=df3)
# try a linear model conditional on x1 and x2
mod1 <-
lm(y ~ x1 + x2,
data=df3)
# try a linear model conditional on x1 and x2 with an interaction
mod2 <-
lm(y ~ x1 + x2 + x1:x2,
data=df3)
Which of these models perform the best ?
Recall our primary interest…does it make better predictions…?
This is a summary of the training error:
# compare predictions on todays data from these two models!
# compare MAE
df3[,mean(abs(y-predict(mod0,newdata=df3)))]
## [1] 3.233767
df3[,mean(abs(y-predict(mod1,newdata=df3)))]
## [1] 2.841716
df3[,mean(abs(y-predict(mod2,newdata=df3)))]
## [1] 2.429928
This is a summary of the generalization error:
# compare predictions on tomorrows data from these two models!
# compare MAE
df3_future[,mean(abs(y-predict(mod0,newdata=df3_future)))]
## [1] 3.272249
df3_future[,mean(abs(y-predict(mod1,newdata=df3_future)))]
## [1] 2.814818
df3_future[,mean(abs(y-predict(mod2,newdata=df3_future)))]
## [1] 2.416803
The MAE are not similar!
Model 1 is much better than model 0.
Model 2 is much better than model 1.
Let’s repeat this with MSE.
This is a summary of the training error:
# compare predictions on todays data from these two models!
# compare MSE
df3[,mean((y-predict(mod0,newdata=df3))**2)]
## [1] 16.604
df3[,mean((y-predict(mod1,newdata=df3))**2)]
## [1] 12.80834
df3[,mean((y-predict(mod2,newdata=df3))**2)]
## [1] 9.253617
This is a summary of the generalization error:
# compare predictions on tomorrows data from these two models!
# compare MSE
df3_future[,mean((y-predict(mod0,newdata=df3_future))**2)]
## [1] 16.77168
df3_future[,mean((y-predict(mod1,newdata=df3_future))**2)]
## [1] 12.40227
df3_future[,mean((y-predict(mod2,newdata=df3_future))**2)]
## [1] 9.125094
The MSE are not similar!
Model 1 is much better than model 0.
Model 2 is much better than model 1.
Why did we see this result? Because the underlying data generating process (reproduced below) was sufficiently complex that the additional capacity in model 3 gave better predictive power.
Recall the different DGP so far:
# visualize the relationship for DGP1
library(ggplot2)
ggplot(df1,aes(x1,y,group=x2, color=x2)) +
geom_point(alpha=0.1) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
# visualize the relationship for DGP2
library(ggplot2)
ggplot(df2,aes(x1,y,group=x2, color=x2)) +
geom_point(alpha=0.1) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
# visualize the relationship for DGP3
library(ggplot2)
ggplot(df3,aes(x1,y,group=x2, color=x2)) +
geom_point(alpha=0.1) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Do you see the increasing complexity of the DGP?
Models with more capacity can give better predictive performance - if the underlying DGP has complexity that is not well captured by the current modeling approach.
But - increasing model capacity will not always improve predictive power!
I will now make the data generating process even more complex - then we will fit some models and summarize the performance.
# simulate some of todays data, continuous
set.seed(42)
n_obs <- 10000
# simulate x1 (continuous)
x1 <- rlnorm(n = n_obs,
meanlog = log(4),
sdlog = log(1.2))
# simulate x2 (binary)
x2 <- rbinom(n = n_obs,
size = 1,
prob = 0.3)
# simulate y, now conditional on X1 and X2
y_mean <-
8 +
ifelse(x1>4,
-4*x2,
3*x2) +
2*scale(x1) +
-4*scale(x1)*x2 +
-4*log(x1)*x2
y_sd <- 3
y <- rnorm(n = n_obs,
mean = y_mean,
sd = y_sd)
# make dataframe
df4 <- data.table(y,
x1,
x2)
# check it out
df4
As before, we simulate some of tomorrow’s data.
# simulate some of tomorrows data, continuous
set.seed(65)
n_obs <- 2000
# simulate x1 (continuous)
x1 <- rlnorm(n = n_obs,
meanlog = log(4),
sdlog = log(1.2))
# simulate x2 (binary)
x2 <- rbinom(n = n_obs,
size = 1,
prob = 0.3)
# simulate y, now conditional on X1 and X2
y_mean <-
8 +
ifelse(x1>4,
-4*x2,
3*x2) +
2*scale(x1) +
-4*scale(x1)*x2 +
-4*log(x1)*x2
y_sd <- 3
y <- rnorm(n = n_obs,
mean = y_mean,
sd = y_sd)
# make dataframe
df4_future <- data.table(y,
x1,
x2)
# check it out
df4_future
Let’s visualize the DGP:
# visualize the relationship
library(ggplot2)
ggplot(df4,aes(x1,y,group=x2, color=x2)) +
geom_point(alpha=0.1) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Fit the same 3 models as before:
# try a linear model
mod0 <-
lm(y ~ 1,
data=df4)
# try a linear model conditional on x1 and x2
mod1 <-
lm(y ~ x1 + x2,
data=df4)
# try a linear model conditional on x1 and x2 with an interaction
mod2 <-
lm(y ~ x1 + x2 + x1:x2,
data=df4)
Which of these models perform the best ?
Recall our primary interest…does it make better predictions…?
This is a summary of the training error:
# compare predictions on todays data from these two models!
# compare MAE
df4[,mean(abs(y-predict(mod0,newdata=df4)))]
## [1] 4.197568
df4[,mean(abs(y-predict(mod1,newdata=df4)))]
## [1] 3.817984
df4[,mean(abs(y-predict(mod2,newdata=df4)))]
## [1] 2.615975
This is a summary of the generalization error:
# compare predictions on tomorrows data from these two models!
# compare MAE
df4_future[,mean(abs(y-predict(mod0,newdata=df4_future)))]
## [1] 4.218995
df4_future[,mean(abs(y-predict(mod1,newdata=df4_future)))]
## [1] 3.838579
df4_future[,mean(abs(y-predict(mod2,newdata=df4_future)))]
## [1] 2.594531
The MAE are not similar!
Model 1 is much better than model 0.
Model 2 is much better than model 1.
We can do the same comparison with MSE.
This is a summary of the training error:
# compare predictions on todays data from these two models!
# compare MSE
df4[,mean((y-predict(mod0,newdata=df4))**2)]
## [1] 30.38514
df4[,mean((y-predict(mod1,newdata=df4))**2)]
## [1] 22.68134
df4[,mean((y-predict(mod2,newdata=df4))**2)]
## [1] 10.77718
This is a summary of the generalization error:
# compare predictions on tomorrows data from these two models!
# compare MSE
df4_future[,mean((y-predict(mod0,newdata=df4_future))**2)]
## [1] 30.60306
df4_future[,mean((y-predict(mod1,newdata=df4_future))**2)]
## [1] 22.37999
df4_future[,mean((y-predict(mod2,newdata=df4_future))**2)]
## [1] 10.62482
The MSE are not similar!
Model 1 is much better than model 0.
Model 2 is much better than model 1.
Model 3 gives us the best fit here. However, we know that model 3 is not the correct functional form of the true underlying data generating process. It is closer than model one and model 2, but we should be able to do better than this.
The underlying challenge is that we do not know the data generating process in practice. We can look for systematic errors in our predictions conditional on our input variables to get a hint, but we never truly know if we have captured the underlying data generating process. We can only assess the relative performance of the different learning algorithms we have chosen to test.
Another key point is that we do not actually observe the future data! In all of the examples thus far we have simulated the future data so we could directly calculate the model performance on it. We do this to drive home the point that the goal of the machine learning process is to generate accurate predictions on that future unobserved data.
But if we do not observe that data, how can we assess performance on that data set?
We must somehow use our observed data and make assumptions that the observed data is a good representation of the unobserved data. Under that assumption, we could do something like split the data into parts, and keep some of those parts off to the side as representations of the future data.
We start with an 80/20 split:
# split data into 80/20 split
set.seed(34)
train_ind <-
sample(1:nrow(df4),
size = nrow(df4)*.8)
df4_train <-
df4[train_ind]
df4_test <-
df4[-train_ind]
Check the dimensionality:
nrow(df4_train)
## [1] 8000
Check the dimensionality:
nrow(df4_test)
## [1] 2000
For brevity sake, I only show fitting a single model here.
Now we use the df4_train to fit models, and the df4_test to represent the future data.
# try a linear model conditional on x1 and x2
mod1 <-
lm(y ~ x1 + x2,
data=df4_train)
# try a linear model conditional on x1 and x2 with an interaction
mod2 <-
lm(y ~ x1 + x2 + x1:x2,
data=df4_train)
# compare predictions on hold out data from these two models!
# compare MSE
df4_test[,mean((y-predict(mod1,newdata=df4_test))**2)]
## [1] 22.85524
df4_test[,mean((y-predict(mod2,newdata=df4_test))**2)]
## [1] 10.97909
We are often interested in commenting on the expected performance of the chosen learning algorithm.
But if we have a simple binary split (testing/training), then we cannot calculate that performance. We have already used the testing data to select a learning algorithm - so the performance metric on that testing data is likely optimistic.
Instead we should split our data into 3 chunks - training, validation, testing:
# split data into 60/20/20 split
# start with training data
set.seed(34)
train_ind <-
sample(1:nrow(df4),
size = nrow(df4)*.6)
df4_train <-
df4[train_ind]
# now split the rest in half for test and validate
test_ind <-
sample(1:nrow(df4[-train_ind]),
size = nrow(df4[-train_ind])*.5)
df4_test <-
df4[-train_ind][test_ind]
df4_validate <-
df4[-train_ind][-test_ind]
Check dimensionality:
nrow(df4_train)
## [1] 6000
Check dimensionality:
nrow(df4_validate)
## [1] 2000
Check dimensionality:
nrow(df4_test)
## [1] 2000
We now do the following:
# try a linear model conditional on x1 and x2
mod1 <-
lm(y ~ x1 + x2,
data=df4_train)
# try a linear model conditional on x1 and x2 with an interaction
mod2 <-
lm(y ~ x1 + x2 + x1:x2,
data=df4_train)
# compare predictions on hold out data from these two models!
# compare MSE
df4_validate[,mean((y-predict(mod1,newdata=df4_validate))**2)]
## [1] 23.17841
df4_validate[,mean((y-predict(mod2,newdata=df4_validate))**2)]
## [1] 10.95803
# pick model 2 as the winner, then estimate performance using the final dataset:
df4_test[,mean((y-predict(mod2,newdata=df4_test))**2)]
## [1] 10.91628
Can we do even better than the linear model with an interaction? We should be able to, since the DGP is more complex than that with the stepwise function.
Recall this is the DGP visually:
# visualize the relationship
library(ggplot2)
ggplot(df4,aes(x1,y,group=x2, color=x2)) +
geom_point(alpha=0.1) +
geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Let’s say we are considering implementing the following learning algorithms and testing their predictive performance (we will go into more details on these later):
glmnet https://en.wikipedia.org/wiki/Elastic_net_regularizationranger https://en.wikipedia.org/wiki/Random_forestxgboost https://en.wikipedia.org/wiki/Gradient_boostingHow can we implement these different learning algorithms?
If we reviewed the ranger documentation, we would find this sort of code to fit the model:
# fit random forest with ranger
library(ranger)
## Warning: package 'ranger' was built under R version 4.1.3
mod_ranger <-
ranger(formula = y ~ x1 + x2,
data = df4_train,
num.trees=100,
sample.fraction = .6,
mtry=2)
We can make predictions using code like this:
## make predictions from ranger
predict(mod_ranger, df4_validate)$prediction[1:10]
## [1] 9.256515 10.763571 9.969048 6.345880 9.023969 8.132649 11.063618
## [8] 6.596976 9.142975 5.425633
We can add the ranger model into the comparison:
# compare predictions on tomorrows data from these two models!
# compare MSE
df4_validate[,mean((y-predict(mod0,newdata=df4_validate))**2)]
## [1] 31.54509
df4_validate[,mean((y-predict(mod1,newdata=df4_validate))**2)]
## [1] 23.17841
df4_validate[,mean((y-predict(mod2,newdata=df4_validate))**2)]
## [1] 10.95803
df4_validate[,mean((y-predict(mod_ranger, df4_validate)$prediction)**2)]
## [1] 11.21399
The ranger model came in second place, but is quite close to the linear model with an interaction.
But how do we fit the elastic net model, or the xgboost model? We would need to review more documentation…
Also we have written a lot of code here! Things we have done:
And we wrote a lot of it by hand…
This is where the MLR3 package comes into play. MLR3 (among others) offers a unified package for doing this sort of work. We do not have to worry about how to code all this, or the different implementations of the packages. Instead, we learn a single package framework for conducting ML experiments.